1 Preliminaries

1.1 Introduction

The main purpose of this notebook is to replicate the analysis of Neumann & Evert (2021) on register variation across three varieties of English (Hong Kong, Jamaica, New Zealand) as represented in the respective components of the International Corpus of English (ICE). We carry out the same analysis on an extended data set covering nine ICE components (including GB and Ireland) in order to see how much results are changed by taking into account the additional six components. Since we build on the pre-processed and annotated version of ICE available from the University of Zurich, linguistic features for the three components from the original study had to be extracted anew with modified queries. In this sense our study also attempts to reproduce the original analysis from the same corpus data.

For these reasons, we will use the same approach and parameter settings as Neumann & Evert (2021). This includes, in particular, the exclusion of short texts in pre-processing (see prepare_data.Rmd) and the use of log-transformed z-scores for all features in order to reduce the impact of outliers (resulting from skewed sparse frequency distributions) on the normality assumptions of the GMA methodology. In the following, we will primarily focus on a comparison between multivariate analyses based on the original three components and those based on all nine components.

In this replication study, we use a recent object-oriented implementation of GMA made available in the R package gmatools. As this package has been written by the second author of Neumann & Evert (2021), the algorithms should be identical (or at least correspond closely) to those in their reproduction materials. The R package contains some additional functionality, though, which turned out to be useful for our replication study. As the package is still in an early experimental stage, it has to be installed directly from GitHub, using the devtools package. The code cell below has to be executed manually for safety reasons. It will only install the package if it is not already available in the system.

if (!requireNamespace("gmatools", quietly=TRUE)) {
  devtools::install_github("schtepf/GMA/pkg/gmatools")
}

Now we can load the gmatools package. All other R packages required by this notebook have already been loaded quietly and are not shown in the output.

library(gmatools)

As the documentation included in the gmatools package is somewhat incomplete and there is no user-friendly tutorial yet, the present notebook also gives some explanations on what different functions and methods do.

1.2 The extended ICE data set

Load the preprocessed data set.

var.names <- load("ice_preprocessed.rda")
## Meta, rand.idx, Features, M, Z, ZL, types.variety, types.shortvar, types.mode, types.format, types.textcat32, types.short32, types.code32, types.textcat20, types.short20, types.code20, types.textcat12, types.short12, types.code12, rainbow.32, rainbow.20, rainbow.12, feature.names

All metadata variables are already coded as factors with a sensible ordering of categories, so no further pre-processing is required here. The data set also includes rainbow colours for text categories and readable feature names. There are 7930 texts and 41 features. See prepare_data.Rmd for details about the distribution of metadata categories and text lengths.

For our reproduction study (and for the comparative analysis in the replication), we will often want to work on a subset of the data set comprising only the three components from the original study (ICE-HK, ICE-JAM, ICE-NZ). We prepare a separate feature matrix and metadata table for this subset. We also add a new column to the metadata table indicating which texts belong to the old and new data set.

Meta[, subset := factor(
  ifelse(Meta$shortvar %in% qw("HK JAM NZ"), "old", "new"), levels=qw("old new"))]
idx3 <- which(Meta$subset == "old")
ZL3 <- ZL[idx3, ]
Meta3 <- droplevels(Meta[idx3, ])
rand.idx3 <- na.omit(match(rand.idx, idx3)) # adjust rand.idx to subset

We refer to this subset as ICE3 and to the complete data set as ICE9.

1.3 Dimensions of variation

To get an overview of the main dimension of linguistic variation in our data set, we carry out an unsupervised PCA. This is only done for the ICE3 data set: the replication for ICE9 will concentrate on the main analysis using weakly-supervised LDA dimensions.

The gmatools implementation of GMA is based on R6 objects of class GMA. A GMA object is initialised with the data set to be analysed and automatically carries out a PCA of the data set. We can obtain the PCA dimensions using the projection() method on the full GMA space (i.e. it returns the coordinates of the data set in PCA dimensions).

PCA <- GMA$new(ZL3)
ZL3.pca <- PCA$projection(space="both")
dim(ZL3.pca)
## [1] 2828   41

GMA’s main approach to the visualisation of multi-dimensional data sets are scatterplot matrices, which work well for 3 to ca. 7 dimensions. Here we show the first 4 PCA dimensions, i.e. an orthogonal, non-distorting perspective on the geometric configuration of the data set in the original feature space, which captures as much distance information (i.e. linguistic variation) as possible.

The GMA tools provide a utility function gma.pairs() (a modification of the standard pairs() plot), which creates a compact display and makes it easy to highlight metadata categories in the plot. Note that we always save plots as PDF files for use in the associated journal paper (even though some might not be used in the end).

gma.pairs(ZL3.pca, 1:4, Meta=Meta3, col=textcat20, pch=variety, 
          pch.vals=1:9, col.vals=rainbow.20, 
          cex=.4, legend.cex=.7, iso=TRUE, compact=TRUE)

save.pdf("ice3_pca4_type_pairs.pdf")

A large part of the linguistic variation captured by the first 4 PCA dimensions seems to be connected to register variation. Regions of this subspace correlate to a certain degree with ICE text categories, though the separation of text categories is far from perfect. The fourth PCA dimension shows little connection to the text categories – if anything it appears to capture some of the outlier texts in the data set. Together, the 4 PCA dimensions account for 51.25 % of the variance

1.4 Scatterplot rows

Neumann & Evert (2021, Fig. 1) use a special type of scatterplot matrix which only plots the first dimension against all other dimensions, but creates two rows for written and spoken mode. For direct comparison and readability in a published paper, this visualisation style seems to be most appropriate. We define a highly specialised function for this purpose, which has been extended with a few configuration options. In particular:

  • col specifies the metadata variable used to determine the colour of data points. Its default (20 mid-level text categories) can be changed, but col.vals will need to be adjusted accordingly.
  • rows specifies the metadata variable used to split the visualisation into rows; neither col nor rows may be disabled
  • optional pch specifies a metadata variable used to determine the plot symbols for data points
  • optional lim sets user-specified axis limits, either as a vector of length 2 (used for all dimensions) or as a two-column matrix specifying axis limits for each dimension of dim. Carefully chosen axis limits are often used to obtain an isometric visualisation (which cannot be guaranteed automatically).
  • optional cols specifies and additional metadata variable used to specify the visualisation into columns. In this case, each panel shows the same two dimensions and dim must have length 2.
  • optional select plots only a subset of the data set based on metadata constraint (evaluated in Meta)
  • grid=TRUE plots a grid in the background at integer coordinates (with slightly thicker lines at 0)
scatterplot.rows <- function (M, dims, Meta, select=NULL,
                              col="textcat20", pch=NULL, rows="mode", cols=NULL,
                              col.vals=rainbow.20, pch.vals=1:10, grid=FALSE,
                              cex=.8, legend.cex=1.5*cex, randomize=TRUE, lim=NULL, dim.string="Dim %d", ...) {
  nR <- nrow(M)
  n.dim <- length(dims)
  stopifnot(n.dim >= 2)
  if (!all(dims %in% seq_len(ncol(M)))) stop("invalid dimensions selected")
  
  if (nrow(Meta) != nR) stop("metadata table Meta= doesn't match data matrix M=")
  select.expr <- substitute(select)
  select <- eval(select.expr, Meta, parent.frame())
  if (!is.null(select)) {
    if (!is.logical(select)) stop("select= must be a Boolean expression selecting the desired items")
    M <- M[select, , drop=FALSE]
    Meta <- Meta[select, , drop=FALSE]
    nR <- nrow(M)
  }
  
  if (!is.null(lim)) {
    if (is.matrix(lim)) {
      if (nrow(lim) != n.dim || ncol(lim) != 2) stop(sprintf("lim= must be a %d x 2 matrix or a vector c(min, max)", n.dim))
    }
    else {
      if (length(lim) != 2) stop(sprintf("lim= must be a %d x 2 matrix or a vector c(min, max)", n.dim))
      lim <- cbind(rep(lim[1], n.dim), rep(lim[2], n.dim))
    }
  }
  else {
    lim <- t(apply(M[, dims, drop=FALSE], 2, expand.range, by=.05))
  }
  
  if (randomize) {
    if (is.numeric(randomize)) set.seed(randomize)
    idx <- sample.int(nR)
    M <- M[idx, , drop=FALSE]
    Meta <- Meta[idx, , drop=FALSE]
  }
  
  if (!is.null(pch)) pch.vec <- pch.vals[ Meta[[pch]] ] else pch.vec <- rep(1, nrow(M))
  col.cat <-as.factor(Meta[[col]])
  col.levels <- levels(col.cat)
  col.vec <- col.vals[col.cat]
  
  plot.panel <- function (d, idx, xlab="", ylab="") {
    xlim <- lim[d, ]
    ylim <- lim[1, ]
    w <- c(0.01, 0.99) # 1% inset from border
    plot(0, 0, type="n", xlim=xlim, ylim=ylim,
         xlab="", ylab="", main="", xaxt="n", yaxt="n")
    if (grid) {
      abline(v=round(xlim[1]):round(xlim[2]), col="lightgrey")
      abline(h=round(ylim[1]):round(ylim[2]), col="lightgrey")
      abline(h=0, v=0, lwd=2, col="lightgrey")
    }
    points(M[idx, dims[d]], M[idx, dims[1]],
         pch=pch.vec[idx], col=col.vec[idx], cex=cex)
    text(mean(xlim), sum(ylim * w), xlab, cex=legend.cex, font=2)
    text(sum(xlim * rev(w)), mean(ylim), ylab, cex=legend.cex, srt=90, font=2)
  }
  
  rows.vec <- droplevels(as.factor(Meta[[rows]]))
  rows.levels <- levels(rows.vec)
  n.rows <- length(rows.levels)
  
  if (!is.null(cols)) {
    if (n.dim != 2) stop("dim= must select exactly 2 dimensions if cols= is specified")
    cols.vec <- droplevels(as.factor(Meta[[cols]]))
    cols.levels <- levels(cols.vec)
    n.cols <- length(cols.levels)
  }
  else {
    n.cols <- n.dim - 1
  }
  
  par(mfrow=c(n.rows, n.cols + 1), mar=c(0, 0, 0, 0)+.2)
  for (i in seq_len(n.rows)) {
    idx.row <- rows.vec == rows.levels[i]
    colvals.row <- unique(col.cat[idx.row])
    idx.levels <- col.levels %in% colvals.row
    
    if (!is.null(cols)) {
      for (j in seq_len(n.cols)) {
        xlab <- if (i == 1) cols.levels[j] else if (i == 2 && j == 1) sprintf(dim.string, dims[2]) else ""
        ylab <- if (j == 1) sprintf(dim.string, dims[1]) else ""
        idx.cell <- idx.row & (cols.vec == cols.levels[j])
        plot.panel(2, idx.cell, xlab=xlab, ylab=ylab)
      }
    } 
    else {
      for (j in 2:n.dim) {
        xlab <- if (i == 1) sprintf(dim.string, dims[j]) else ""
        ylab <- if (j == 2) sprintf(dim.string, dims[1]) else ""
        plot.panel(j, idx.row, xlab=xlab, ylab=ylab)
      }
    }
      
    plot(0, 0, type="n", ann=FALSE, bty="n", xaxt="n", yaxt="n")
    legend(0, 0, xjust=0.5, yjust=0.5, cex=legend.cex,
       title=rows.levels[i], bty="n",
       legend=col.levels[idx.levels], 
       fill=col.vals[idx.levels], border=col.vals[idx.levels])
  }
}

We can now create a version of the PCA plot that corresponds to the LDA visualisation of Neumann & Evert (2021).

scatterplot.rows(ZL3.pca, 1:4, Meta3, pch="variety", dim.string="PCA %d")

save.pdf("ice3_pca4_type.pdf", width=12, height=8)

2 The LDA space of text categories

The main goal of Neumann & Evert (2021) was to study the interaction between language varieties and register variation. In order to draw meaningful conclusions, we need a clearly interpretable and well-structured register space. While we might try to interpret the first 3 PCA dimensions as dimensions of register variation (in a Biberian approach), coming up with clear and empirically well-founded intepretations can be challenging. Moreover, we cannot be sure that these dimensions primarily capture register variation rather than other aspects such as individual stylistic choices. Finally, the PCA space lacks visual structure: the data set is a nearly spherical blob structured only by colour-coding text categories. If there is indeed structure in the geometric configuration of the data set – a fundamental assumption of GMA – the PCA fails to recover it.

This is where the weakly-supervised intervention central to GMA comes in. Following Neumann & Evert (2021), we use supervised LDA (linear discriminant analysis) to create a register space based on the ICE text categories. The crucial advantages are that the resulting latent dimensions focus on the aspects of register variation captured by text categories (minimising the impact of any other factors of linguistic variation), and that the well-separated text categories provide a viusal map of the register space that helps us interpret our observations.

2.1 GMA objects

As has been pointed out before, a GMA object is initialised with a data set that determines the dimensionality of the original feature space and that is its main object of analysis (though we can use the GMA object with other data points as well). At its core, GMA decomposes the feature space into a focus space and its orthogonal complement. The dimensions of the focus space are usually determined by a weakly-supervised analysis of the data set, but can also be defined manually or copied from another GMA object. GMA objects use orthonormal basis vectors for both focus and complement space, in order to enable orthogonal, geometry-preserving projections. The basis vectors of the complement space are determined by a PCA of the internal data set (projected into the complement space), so that the first complement dimensions capture as much of the remaining variation as possible. When a GMA object is first initialised, its focus space is empty (0-dimensional), so the complement space contains a full PCA of the data set – a fact that we exploited in the previous section.

Throughout this notebook, we want to compare the LDA register space for the ICE3 subset (which should reproduce the study of Neumann & Evert 2021) with an LDA register space based on the complete ICE9 data set. For this purpose, we need two separate GMA objects initialised with the ICE3 and ICE9 data sets, respectively.

ICE3 <- GMA$new(ZL3)
print(ICE3)
## GMA object representing projection of 2828 x 41 data matrix into 0-dimensional subspace
ICE9 <- GMA$new(ZL)
print(ICE9)
## GMA object representing projection of 7930 x 41 data matrix into 0-dimensional subspace

One important thing to keep in mind that the GMA tools use R6 reference classes, so that GMA objects are modified in place (in contrast to most other R objects such as data frames, with the notable exception of data.tables). For this reason, we will later need to clone our objects in order to compare different focus spaces for the same data set.

2.2 Reproducing Neumann & Evert (2021)

Our first step is to reproduce the analysis of Neumann & Evert (2021) with our ICE3 data set (which is a recreation of their data). First, we perform an LDA based on ICE text categories (using the same intermediate-level 20-category system as in our visualisations).

lda.textcat <- ICE3$discriminant(Meta3$textcat20)
dim(lda.textcat)
## [1] 41 19

The LDA has needed 19 dimensions for an optimal separation of the 20 text categories (in contrast to other LDA applications that sometimes achieve separation with many fewer dimensions than categories). Of course, reducing our 41-dimensional feature space to a 19-dimensional focus space is of little use. We will thus focus on the first few dimensions of the LDA instead.

Neumann & Evert (2021: 155) settle on the first 4 dimensions, which allow a separation of the text categories with 60% accuracy (using an SVM classifier with 5-fold cross-validation), compared to 72.6% accuracy in all 19 dimensions. This shows that the reduction of the register space to 4 dimensions does not discard much structure and we should still be able to clearly make out regions corresponding to the different text categories.

For our reproduction, we simply follow the decision of Neumann & Evert (2021). We might later also apply SVM classifiers to different subsets of the LDA dimensions or determine pairwise discrimination of text categories.

We use the add() method to add the first four LDA dimensions to the focus space of the GMA object. Note that the LDA axis vectors are neither orthogonal nor normalised to unit length (since they actually represent discriminants rather than dimensions).

round(crossprod(lda.textcat[, 1:4]), 3)
##        LD1    LD2    LD3    LD4
## LD1  4.882 -3.350 -0.105  1.665
## LD2 -3.350 16.640  3.333 -4.508
## LD3 -0.105  3.333  9.368  0.058
## LD4  1.665 -4.508  0.058 13.031

The GMA object automatically determines an orthonormal basis of the new focus space such that the first \(k\) basis vectors span the same subspace as the first \(k\) LDA axis vectors.

ICE3$add(lda.textcat[, 1:4])
print(ICE3)
## GMA object representing projection of 2828 x 41 data matrix into 4-dimensional subspace
round(crossprod(ICE3$basis("focus")), 3)
##     LD1 LD2 LD3 LD4
## LD1   1   0   0   0
## LD2   0   1   0   0
## LD3   0   0   1   0
## LD4   0   0   0   1

Now we can visualise the ICE3 data set in the LDA register space as a scatterplot matrix. The coordinates to be plotted are the projections of the data points into the focus space of the GMA object.

ICE3.X <- ICE3$projection("focus")
gma.pairs(ICE3.X, 1:4, Meta=Meta3, col=textcat20, pch=variety, 
          col.vals=rainbow.20, 
          cex=.4, legend.cex=.7, iso=TRUE, compact=TRUE)

save.pdf("ice3_lda_raw_type_pairs.pdf")

This is indeed a nicely structured register space, assigning text categories to different regions and arranging related categories next to each other. It is thus very plausible as a linguistically interpretable basis space for our further analysis.

It is unfortunate that the striking double banana shape (dare I call it “phallic”?) doesn’t line up nicely with the dimensions of our focus space. For such situation, the gmatools package extends GMA with the option of performing rotations in (some dimensions of) the focus space. Unlike the “rotations” of factor analysis, only true rotations of the coordinate system, i.e. isometric linear maps. Here we apply a “varimax” style rotation to the first two dimensions (by performing a PCA in those two dimensions), aligning the bananas with the first dimension.

ICE3$rotation("pca", dim=1:2)

Now the scatterplot matrix visually matches the results of Neumann & Evert (2021) very well, nicely reproducing their result.

ICE3.X <- ICE3$projection("focus")
gma.pairs(ICE3.X, 1:4, Meta=Meta3, col=textcat20, pch=variety, 
          col.vals=rainbow.20,
          cex=.4, legend.cex=.7, iso=TRUE, compact=TRUE)

save.pdf("ice3_lda_type_pairs.pdf")

We can now visualise scatterplot rows matching Neumann & Evert (2021, Fig. 1), using suitable fixed axis ranges for each dimension. We start from the ranges used in the original paper, but might need to adjust them in order to accommodate the ICE9 register space of the replication study. Since each panel has a 4:3 aspect ratio in the PDF plot, we have to choose suitable ranges to ensure an isometric display.

axis.lim <- matrix(c(-3.1, 2.5,  -2.0, 2.2,  -2.0, 2.2,  -2.0, 2.2),
                   ncol=2, byrow=TRUE)
scatterplot.rows(ICE3.X, 1:4, Meta3, pch="variety", pch.vals=c(1, 3, 4), lim=axis.lim)

save.pdf("ice3_lda_type.pdf", width=12, height=8)

Finally, are there differences between language varieties in the register space, i.e. do registers differ between the three varieties? We investigate this question first by using different colours to highlight the three language varieties rather than text categories.

scatterplot.rows(ICE3.X, 1:4, Meta3, col="variety", col.vals=simple.pal, lim=axis.lim)

save.pdf("ice3_lda_var.pdf", width=12, height=8)

An alternative is to focus on the first two dimensions (showing the most interesting geometric structure) and put separate scatterplots for the three varieties side by side. Since we can now use colours to highlight text categories again, this gives a better picture on register-related divergences between the varieties.

scatterplot.rows(ICE3.X, 1:2, Meta3, pch="variety", cols="variety",
                 pch.vals=c(1, 3, 4), lim=axis.lim[1:2,], grid=TRUE)

save.pdf("ice3_lda_type_by_var.pdf", width=12, height=8)

In order to determine how much of the linguistic variation in our data set is captured by the four dimensions of our focus space, we can determine the proportion \(R^2\) of variance (= squared distance information) that is preserved in the orthogonal projection. The R2() method returns the precentage for each dimension of the focus space, and we can add some PCA dimensions from the complement space for comparison.

ICE3$R2(dim=1:8)
##       LD1       LD2       LD3       LD4       PC1       PC2       PC3       PC4 
## 13.309158  1.268158  1.935915  1.355516 18.472037  7.652326  6.238439  4.728298

The total \(R^2\) is of the focus space is only 17.87%. Let us include three complement dimensions in the visualisation to add perspective.

tmp <- ICE3$projection("both")
gma.pairs(tmp, 1:7, Meta=Meta3, col=textcat20, pch=variety, 
          col.vals=rainbow.20, 
          cex=.2, legend.cex=.35, iso=TRUE, compact=TRUE)

save.pdf("ice3_lda_type_with_pca.pdf", width=12, height=9)

The rightmost three columns of the plot show the first PCA dimensions from the complement space. It is evident that PC1 is correlated with our first focus dimension, but also captures substantial amounts of variation within each text category. PC2 also helps to separate certain text categories, but provides as less clear-cut separation than the focus space dimensions overall. PC3 appears to capture a substantial amount of variation that is not directly related to text categories and might be connected with individual style or to topic.

Neumann & Evert (2021) label the dimensions of the focus space as conceptual speaking vs. conceptual writing (LDA dim 1), dialogic written vs. neutral (LDA dim 2), descriptive-narrative vs. instructive-regulative (LDA dim 3), and neutral vs. online production (LDA dim 4). Their interpretation is based on the visual reference system created by the positions of different text categories within the focus space, combined with feature weights of the LDA dimensions (which are Biber’s main entry point for interpretation). The barplots below show feature weights for the three dimensions, given by the coordinates of the orthogonal basis vectors. The barplot only shows features \(i\) that have a substantial weight \(|p_{ij}| \geq .1\) in at least one dimension \(j\). Keep in mind that feature weights are relative within each basis vector (because \(\|\mathbf{p}_{\bullet j}\|_2 = 1\)); a discriminant characterised by consistently large values of many different features would assign relatively low weights to all of them.

Since the original paper uses an adapted colour scale for each plot, leading to more saturated colours than our common scale for all four dimensions, we need the new zlim option to enforce a scale that looks sufficiently similar to all barplots to allow for easy direct comparison.

ICE3.P <- ICE3$basis("focus")
idx.weights <- apply(abs(ICE3.P), 1, max) >= .1 # only show features with substantial weight
gma.plot.weights(ICE3.P, dim=1:4, feature.names=feature.names, names=paste("LDA dim", 1:4), 
                 idx=idx.weights, ylim=c(-.75, .45), zlim=c(-.4, .4))

save.pdf("ice3_lda_weights.pdf", width=8, height=7)

These plots look quite similar to the ones shown in Neumann & Evert (2021), though there are some noticeable differences – our replication is close to the previous study, but not quite the same. Overall feature weights are distributed somewhat more equally, but some changes might lead to a subtly different linguistic interpretation of the dimensions.

Neumann & Evert (2021, Fig. 4) complement their interpretation by looking at the contribution of different features to the discriminant scores of text categories, which Evert & Neumann (2017) insist on to avoid misinterpretation of feature weights. The numerous comparisons of different categories for each LDA dimensions are only made possible by an interactive Web app, so we do not include this step in our replication experiment.

3 Extending the corpus

We now extend the analysis to our full ICE corpus covering 9 language varieties. This can be done in two ways:

  1. Project texts from the additional 6 language varieties into the focus space determined on ICE3. This is a very limited form of replication, which tests only whether the observations made in the original study were specific to the 3 selected varieties or representative of more general patterns.
  2. Carry out an LDA on the full ICE9 corpus to determine a new focus space. This is a more challenging replication study as the LDA is carried out on an extended data set and might result in a substantially different latent space. It is still less demanding than a replication on an entirely different data set of on a different selection of linguistic features.

Depending on which approach we take, different comparisons will be of interest, such as

  • coordinates of the “new” varieties vs. those of the “old” varieties in the ICE3 focus space
  • coordinates of the “old” (or “new”) varieties in the ICE3 vs. ICE9 focus space
  • orthogonal basis dimensions of the ICE3 vs. ICE9 focus space

Since our main focus here is on differences between language varieties and on the efffects of including additional varieties in the LDA analysis, we use colour coding to represent the ICE components in our first overview plots. Text categories are not highlighted at all at this point, but plot symbols differentiate between spoken and written language.

3.1 Adding texts to the ICE3 focus space

We can apply the ICE3 orthogonal projection also to new data, allowing us to obtain projection coordinates in the ICE3 focus space for all texts. The coordinates of ICE3 texts in the new projection should be identical to their original coordinates (ICE3.X).

ICE3.X9 <- ICE3$projection("focus", M=ICE9$data)
stopifnot(all.equal(ICE3.X, ICE3.X9[idx3, ]))

We can now visualise both sets of texts in this focus space. We also re-create the plot for the ICE3 varieties for direct comparison.

gma.pairs(ICE3.X9, 1:4, Meta=Meta, select=idx3,
          col=variety, col.vals=simple.pal,
          pch=mode, pch.vals=c(1, 3),
          cex=.4, legend.cex=.8, lim=c(-3.25, 2.75), compact=TRUE)

gma.pairs(ICE3.X9, 1:4, Meta=Meta, 
          col=variety, col.vals=simple.pal,
          pch=mode, pch.vals=c(1, 3),
          cex=.4, legend.cex=.8, lim=c(-3.25, 2.75), compact=TRUE)

There isn’t much difference between the six additional varieties and the ICE3 texts: they nicely fill in the shape sketched by the original data set. A few noticeable shifts remain for Hong Kong and India (top left panel) and for Ireland (top centre panel), all on the conceptual speaking end of the first dimension.

Highlighting just ICE3 vs. other varieties might help pick out smaller differences between the old and new texts more clearly. In order to balance the colours, we divide texts into 3 groups: ICE3, West and Asia.

Meta[, group := factor(
  ifelse(subset == "old", "ICE3",
         ifelse(shortvar %in% qw("GB IRE CAN"), "West", "Asia")),
  levels = qw("ICE3 West Asia"))]
Meta[, table(group, shortvar)]
##       shortvar
## group    NZ  JAM   HK  IND  PHI  SIN  CAN  IRE   GB
##   ICE3  814  904 1110    0    0    0    0    0    0
##   West    0    0    0    0    0    0  964  817  877
##   Asia    0    0    0  672  885  887    0    0    0

There aren’t any striking differences between the three groups of varieties, except for some small local regionsthat seem to be dominated by one of the groups.

gma.pairs(ICE3.X9, 1:4, Meta=Meta, 
          col=group, col.vals=simple.pal,
          pch=mode, pch.vals=c(1, 3),
          cex=.3, legend.cex=1, lim=c(-3.25, 2.75), compact=TRUE)

save.pdf("ice3_lda_ice9_group_pairs.pdf")

For the paper, the overview scatterplot matrices above are not considered sufficiently readable and intuitive. Hence we create scatterplot rows for the the first two LDA dimensions across all nine varieties. As above, we show three varieties each in a single display, divided into the ICE3 varieties, Western varieties, and Asian varieties. Note that we have already saved the first of these plots to a PDF file above. It is repeated here so that we can easily switch between all three displays in the interactive notebook.

scatterplot.rows(ICE3.X9, 1:2, Meta, pch="variety", cols="variety", select=(group == "ICE3"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim[1:2,], grid=TRUE)

scatterplot.rows(ICE3.X9, 1:2, Meta, pch="variety", cols="variety", select=(group == "West"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim[1:2,], grid=TRUE)

save.pdf("ice3_lda_ice9_type_by_var_west.pdf", width=12, height=8)
scatterplot.rows(ICE3.X9, 1:2, Meta, pch="variety", cols="variety", select=(group == "Asia"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim[1:2,], grid=TRUE)

save.pdf("ice3_lda_ice9_type_by_var_asia.pdf", width=12, height=8)

3.2 Comparison with the ICE9 focus space

Having found no striking differences between the three language varieties investigated by Neumann & Evert (2021) and the other six varieties in the ICE9 corpus, we now look at the LDA focus space itself and to what extent it is a product of their choice of varieties.

For this purpose, we create a new LDA focus space based on all texts from the ICE9 corpus. The add.discriminant() method provides a convenient shortcut. Warning: Executing this call more than once will keep adding four LDA dimensions at a time to the focus space. In order to prevent such mistakes, we re-initialise the ICE9 object first (unfortunately, there is no method yet for dropping focus dimensions).

ICE9 <- GMA$new(ZL)
ICE9$add.discriminant(Meta$textcat20, max.dim=4)
ICE9
## GMA object representing projection of 7930 x 41 data matrix into 4-dimensional subspace

Keeping in mind the importance that GMA places on visualisation, we should look at a scatterplot matrix before carrying out further steps (such as applying a rotation). In order to avoid code duplication, this notebook shows the scatterplot only after all steps have been completed, but you can skip down and execute the cell now in order to confirm that the LDA has worked as intended.

ICE9$rotation("pca", dim=1:2)
ICE9$rotation("flip", dim=2)

The visualisation shows that after PCA rotation, the left and right sides of the second dimension are flipped compared to the original analysis. We correct this manually to make the two spaces as comparable as possible. We also check how much of the variation between texts is captured by our focus space.

ICE9$R2()
##       LD1       LD2       LD3       LD4 
## 11.872925  1.606141  3.806092  1.633540
ICE9.X <- ICE9$projection("focus")
gma.pairs(ICE9.X, 1:4, Meta=Meta, col=textcat20, pch=group, 
          col.vals=rainbow.20,
          cex=.2, legend.cex=.7, iso=TRUE, compact=TRUE)

save.pdf("ice9_lda_type_pairs.pdf")

While the first two dimensions appear to be quite similar to those of the ICE3 focus space, the visual impression of the third dimension especially is entirely different. In order to allow for a clearer comparison in the paper, we show only the original ICE3 components and the first row of the scatterplot matrix split into written and spoken texts. (In the notebook, we also plot the other two groups of varieties as overlays.)

scatterplot.rows(ICE9.X, 1:4, Meta, pch="variety", select=(group == "ICE3"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim)

save.pdf("ice9_lda_type_ice3.pdf", width=12, height=8)
scatterplot.rows(ICE9.X, 1:4, Meta, pch="variety", select=(group == "West"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim)

scatterplot.rows(ICE9.X, 1:4, Meta, pch="variety", select=(group == "Asia"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim)

Our first impression is thus that the new LDA focus space is markedly different from the one in our replication experiment. While the first two dimensions appear to be stable, further dimensions strongly depend on the language varieties included. The dimensions weights also suggest a similar interpretation for LDA dim 1 and 2, but point into an entirely different direction for LDA dim 3.

ICE9.P <- ICE9$basis("focus")
idx.weights <- apply(abs(ICE9.P), 1, max) >= .1 # only show features with substantial weight
gma.plot.weights(ICE9.P, dim=1:4, feature.names=feature.names, names=paste("LDA dim", 1:4),
                 idx=idx.weights, ylim=c(-.75, .45), zlim=c(-.4, .4))

save.pdf("ice9_lda_weights.pdf", width=8, height=7)

To confirm this impression, we need a quantitative criterion for the similarity or dissimilarity of the two focus spaces. Evert & Neumann (2017) had an easy solution for their one-dimensional focus spaces, using the angle between the single basis vectors of different focus spaces as a simple and intuitive measure. The gmatools package includes a more general measure of subspace similarity \(\text{Sim}_1\) that can be interpreted as the (fractional) number of shared dimensions between the two spaces. Some concrete examples of possible results for the comparison of four-dimensional subspaces A and B might help get a more intuitive grasp of the measure:

  • \(\text{Sim}_1 = 4\) means that A and B are identical (because they share all 4 dimensions).
  • \(\text{Sim}_1 = 3\) is the case if A and B share 3 dimensions, while their fourth dimensions are orthogonal to each other. Note that sharing dimensions doesn’t mean that the orthogonal basis vectors must be identical to begin with: it might be necessary to apply rotations in both space to match up basis vectors.
  • \(\text{Sim}_1 = 3.87\) is the case if A and B share 3 dimensions, but their fourth dimensions are oblique at an angle of ca. 30 degrees (because \(\cos 30^{\circ} \approx .87\)). It is also the case if A and B share 2 dimensions, while their third and fourth dimensions are oblique at an angle of 21 degrees, respectively (because \(2\cdot \cos 21^{\circ} \approx 1.87\)).
  • \(\text{Sim}_1 = 3\) is thus also the case if A and B share only two dimensions, while their third and fourth dimensions are oblique at an angle of 60 degrees (because \(\cos 60^{\circ} = \frac12\)).

The similarity() method is a convenient way to compute the subspace similarity between two focus spaces.

ICE9$similarity(ICE3)
## [1] 3.714403

The relatively high similarity value suggests that the two focus spaces might be more alike than our visualisation above suggests. We can also decompose the similarity value into components for shared and oblique dimensions.

tmp <- ICE9$similarity(ICE3, method="sigma")
data.frame(sim=tmp, angle=acos(tmp) * 180 / pi, 
           row.names=sprintf("aligned dim %d", 1:4))
##                     sim     angle
## aligned dim 1 0.9908074  7.774797
## aligned dim 2 0.9597389 16.313552
## aligned dim 3 0.8896091 27.175829
## aligned dim 4 0.8742471 29.044002

In line with our visual impression, two dimensions are very close to the original analysis, while the other two dimensions are oblique at close to 30 degrees. Note that the dimensions shown here do not necessarily correspond to the dimensions of either focus space, but represent two optimally aligned sets of basis vectors in the two GMA spaces.

Apparently, the ICE3 and ICE9 focus spaces are more similar than our visualisation has made us believe. It would seem that further rotations of the ICE9 basis are needed in order to bring out the visual similarity. The first two dimensions already match quite well, so the additional rotation will mostly affect LDA dim 3 and 4. Even in the 3-dimensional plots, it is difficult to guess exactly which rotation is called for – and finding it by trial and error is at best a tedious process. Fortunately, gmatools offers functionality to rotate the focus space basis automatically until the best possible match with the ICE3 basis is achieved. This is referred to as a manual rotation because the basis is rotated to match user-specified axis vectors. Conveniently, we can directly specify the ICE3 basis, which is then automatically projected into the ICE9 focus space and re-orthogonalised.

ICE9$rotation("manual", basis=ICE3, debug=TRUE)
## 1) rotation angle phi = 21.52 deg
##    | b[1] - a[1] |^2 = 0.000000
##    preservation of focus space: lost 0 dims
## 2) rotation angle phi = 16.68 deg
##    | b[2] - a[2] |^2 = 0.000000
##    preservation of focus space: lost -8.88178e-16 dims
## 3) rotation angle phi = 33.67 deg
##    | b[3] - a[3] |^2 = 0.000000
##    preservation of focus space: lost 0 dims

Notice that the second dimensions of the two focus spaces appear to correspond more closely to one another than the first dimensions (so they require a smaller rotation angle to become aligned). The scatterplot matrix now reveals a picture that looks much more familiar from the ICE3 analysis.

ICE9.X <- ICE9$projection("focus")
gma.pairs(ICE9.X, 1:4, Meta=Meta, col=textcat20, pch=group, 
          col.vals=rainbow.20,
          cex=.4, legend.cex=.7, iso=TRUE, compact=TRUE)

save.pdf("ice9_ldamatch_type_pairs.pdf")

Again we show the scatterplots in the first row separately for the ICE3 varieties (and the other two groups as overlays), split into written and spoken texts.

scatterplot.rows(ICE9.X, 1:4, Meta, pch="variety", select=(group == "ICE3"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim)

save.pdf("ice9_ldamatch_type_ice3.pdf", width=12, height=8)
scatterplot.rows(ICE9.X, 1:4, Meta, pch="variety", select=(group == "West"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim)

scatterplot.rows(ICE9.X, 1:4, Meta, pch="variety", select=(group == "Asia"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim)

We also visualise feature weights after the alignment rotation.

ICE9.P <- ICE9$basis("focus")
idx.weights <- apply(abs(ICE9.P), 1, max) >= .1 # only show features with substantial weight
gma.plot.weights(ICE9.P, dim=1:4, feature.names=feature.names, names=paste("LDA dim", 1:4),
                 idx=idx.weights, ylim=c(-.75, .45), zlim=c(-.4, .4))

save.pdf("ice9_ldamatch_weights.pdf", width=8, height=7)

Finally, we take a look at differences between the three sets of language varieties in the new matching perspective.

scatterplot.rows(ICE9.X, 1:2, Meta, pch="variety", cols="variety", select=(group == "ICE3"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim[1:2,], grid=TRUE)

save.pdf("ice9_ldamatch_type_by_var_ice3.pdf", width=12, height=8)
scatterplot.rows(ICE9.X, 1:2, Meta, pch="variety", cols="variety", select=(group == "West"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim[1:2,], grid=TRUE)

save.pdf("ice9_ldamatch_type_by_var_west.pdf", width=12, height=8)
scatterplot.rows(ICE9.X, 1:2, Meta, pch="variety", cols="variety", select=(group == "Asia"),
                 pch.vals=rep(c(1, 3, 4), 3), lim=axis.lim[1:2,], grid=TRUE)

save.pdf("ice9_ldamatch_type_by_var_asia.pdf", width=12, height=8)

Observations are very much in line with the previous analyses, which is very reassuring. Interestingly, differences between the varieties seem slightly less pronounced now, perhaps because the LDA across all 9 varieties aims to factor out any differences between varieties within the same text category.

4 Making sense of latent dimensions

4.1 Close reading

Interpretation of feature weights is difficult and can be misleading (cf. the discussion in Neumann & Evert 2021). One possibility is close reading of some texts in selected areas of the focus space. In order to select suitable texts, we need their coordinates (available in our focus space projection matrix ICE9.X) and metadata (so we can e.g. select extreme texts from specific categories).

For example, we can select the most extreme examples of conceptual speaking at the negative end of dimension 1. A suitable threshold can be gleaned from the scatterplots above or from the distribution summaries for each dimension. We index samples by text ID so it’s easier to work with subsets of the data.

summary(ICE9.X)
##       LD1                LD2                LD3                LD4          
##  Min.   :-3.45066   Min.   :-1.68052   Min.   :-1.37644   Min.   :-1.43972  
##  1st Qu.:-0.98608   1st Qu.:-0.23899   1st Qu.:-0.32338   1st Qu.:-0.32371  
##  Median : 0.29177   Median : 0.08729   Median :-0.05100   Median :-0.06458  
##  Mean   : 0.08788   Mean   :-0.01538   Mean   : 0.03974   Mean   : 0.02145  
##  3rd Qu.: 1.37867   3rd Qu.: 0.28164   3rd Qu.: 0.28411   3rd Qu.: 0.29819  
##  Max.   : 2.56051   Max.   : 1.12353   Max.   : 2.89104   Max.   : 2.05328
sample1 <- ICE9.X[, 1] < -3.2
sample1 <- rownames(ICE9.X)[sample1]
ICE9.X[sample1, ]
##                         LD1         LD2         LD3        LD4
## icegb_s1a-095_2   -3.243380  0.06181059  0.49425687 -0.3680570
## icegb_s1a-098_1   -3.412493  0.28186293  0.34916108 -0.1594776
## icegb_s1a-098_2   -3.290204  0.00950255  0.04572556 -0.4898973
## icegb_s1a-099_2   -3.450665  0.18374404 -0.25593367 -0.4532148
## icegb_s1a-100_2   -3.246579 -0.29749853  0.37873145 -0.4150782
## icesing_s1a-099_1 -3.226246  0.12694985  0.07896369 -0.5940627

We might take a closer look at icegb_s1a-095_2, which needs to be obtained from the original ICE corpus. Its detailed metdata are shown in the first row of the table below.

text1 <- "icegb_s1a-095_2"
Meta[sample1, ]
## Key: <id>
##                   id       variety   mode   format short32  textcat32
##               <char>        <fctr> <fctr>   <fctr>  <fctr>     <fctr>
## 1:   icegb_s1a-095_2 Great Britain spoken dialogue   phone phonecalls
## 2:   icegb_s1a-098_1 Great Britain spoken dialogue   phone phonecalls
## 3:   icegb_s1a-098_2 Great Britain spoken dialogue   phone phonecalls
## 4:   icegb_s1a-099_2 Great Britain spoken dialogue   phone phonecalls
## 5:   icegb_s1a-100_2 Great Britain spoken dialogue   phone phonecalls
## 6: icesing_s1a-099_1     Singapore spoken dialogue   phone phonecalls
##         code32 short20                textcat20 code20 short12 textcat12 code12
##         <fctr>  <fctr>                   <fctr> <fctr>  <fctr>    <fctr> <fctr>
## 1: S1A-091-100    conv conversations/phonecalls    S1A    priv   private    S1A
## 2: S1A-091-100    conv conversations/phonecalls    S1A    priv   private    S1A
## 3: S1A-091-100    conv conversations/phonecalls    S1A    priv   private    S1A
## 4: S1A-091-100    conv conversations/phonecalls    S1A    priv   private    S1A
## 5: S1A-091-100    conv conversations/phonecalls    S1A    priv   private    S1A
## 6: S1A-091-100    conv conversations/phonecalls    S1A    priv   private    S1A
##    shortvar  word  sent subset  group
##      <fctr> <int> <int> <fctr> <fctr>
## 1:       GB   261    51    new   West
## 2:       GB   707   144    new   West
## 3:       GB   376    81    new   West
## 4:       GB   892   194    new   West
## 5:       GB   122    29    new   West
## 6:      SIN   877   153    new   Asia

In order to make sense of individual features of such a text, we want to know (i) whether some features are individually extreme and, more importantly, (ii) to what extent they contribute to the position of the text along dimension 1 (i.e. which features “push” the text to the negative end of the dimension).

We obtain the contributions of individual features to the position of each text in dimension 1 by multiplying the standardised feature vectors with the dimension weights (i.e. the coordinates of its basis vector). We can sort the vector to highlight features with the largest contributions (which also stand out in close reading of the text).

w1 <- ICE9.P[, 1] # feature weights in dim 1
ICE9.Contrib1 <- gmatools:::.scaleMargins(ZL, cols=w1) # gmatools has a hidden internal function for scaling columns of a matrix
colnames(ICE9.Contrib1) <- paste(colnames(ZL), ifelse(w1 < 0, " ↓", ""), sep="")
tmp <- sort(ICE9.Contrib1[text1, ])
tmp
##   disc_initial_S ↓    lexical_density    p2_perspron_P ↓    pronoun_all_W ↓ 
##       -0.760877020       -0.397742605       -0.288716314       -0.283929677 
##       pospers1_W ↓           finite_S    p1_perspron_P ↓             word_S 
##       -0.265298267       -0.210633501       -0.192640181       -0.178832740 
##     poss_pronoun_W             prep_W     prep_initial_S           will_F ↓ 
##       -0.111234865       -0.099677276       -0.087455246       -0.083692857 
##          nominal_W     wh_initial_S ↓            atadj_W       imperative_S 
##       -0.064018604       -0.054148388       -0.053484320       -0.051317162 
##      nom_initial_S       past_tense_F    adv_initial_S ↓               it_P 
##       -0.043411193       -0.034059295       -0.034003476       -0.026697709 
##          passive_F         neoclass_W   subord_initial_S       infinitive_F 
##       -0.023608783       -0.019451021       -0.016717577       -0.016617129 
##   verb_initial_S ↓       pospers2_W ↓    subordination_F       modal_verb_V 
##       -0.015593150       -0.013657308       -0.009506192       -0.006010114 
##           verb_W ↓ nonfin_initial_S ↓      place_adv_W ↓               nn_W 
##       -0.005067667       -0.002916518       -0.002888883       -0.001643351 
##            title_W    p3_perspron_P ↓    interrogative_S             np_W ↓ 
##       -0.001593462        0.006466321        0.008643602        0.011554474 
##       time_adv_W ↓   text_initial_S ↓          predadj_W   coordination_F ↓ 
##        0.015444047        0.019354789        0.029224963        0.050814562 
##       pospers3_W ↓ 
##        0.072259378
sum(tmp[c(3,5,7)]) # features relating to 1st/2nd person pronouns
## [1] -0.7466548
sum(tmp[1:8]) # total contribution of features explicitly mentioned in the paper
## [1] -2.57867

A complementary perspective is how the feature contributions compare to other texts (for the same feature). Our close-reading interpretation suggested that the chosen text is quite extreme in its use of spoken-language features (viz. the first 8 features in the sorted vector above). We confirm this by determining the quantiles corresponding to the dimension score contributions of these features. For example, we find that the selected text is among the 2% of texts with highest proportion of discourse markers in sentence-initial position; among ca. 10% of highest proportions of first and second person pronouns; and among the 1% of texts with shortest sentence length and lowest lexical density. Note that whether the quantiles correspond to the lowest or highest feature values cannot be seen directly from the contributions: the signs of the corresponding feature weights have to be taken into account (with negative weights marked ↓ in the labels).

feature.quantiles <- function (M, groups=NULL) {
  if (is.null(groups)) {
    Q <- apply(M, 2, function (x) rank(x) / length(x))
    rownames(Q) <- rownames(M)
  }
  else {
    stopifnot(length(groups) == nrow(M))
    groups <- as.factor(groups)
    Q <- M
    for (l in levels(groups)) {
      idx <- groups == l
      Q[idx, ] <- feature.quantiles(M[idx, , drop=FALSE])
    }
  }
  Q
}
ICE9.Quant1 <- feature.quantiles(ICE9.Contrib1)
ICE9.Quant1[text1, ][order(ICE9.Contrib1[text1, ])] # show in same order as contributions above
##   disc_initial_S ↓    lexical_density    p2_perspron_P ↓    pronoun_all_W ↓ 
##        0.018348045        0.003909206        0.107755359        0.123518285 
##       pospers1_W ↓           finite_S    p1_perspron_P ↓             word_S 
##        0.098991173        0.032408575        0.097477932        0.007692308 
##     poss_pronoun_W             prep_W     prep_initial_S           will_F ↓ 
##        0.015069357        0.004854981        0.071815889        0.094199243 
##          nominal_W     wh_initial_S ↓            atadj_W       imperative_S 
##        0.052143758        0.185813367        0.086002522        0.194262295 
##      nom_initial_S       past_tense_F    adv_initial_S ↓               it_P 
##        0.093253468        0.031715006        0.245964691        0.327364439 
##          passive_F         neoclass_W   subord_initial_S       infinitive_F 
##        0.252522068        0.079382093        0.302900378        0.068978562 
##   verb_initial_S ↓       pospers2_W ↓    subordination_F       modal_verb_V 
##        0.337957125        0.105044136        0.184741488        0.513240858 
##           verb_W ↓ nonfin_initial_S ↓      place_adv_W ↓               nn_W 
##        0.360781841        0.229319042        0.366960908        0.024779319 
##            title_W    p3_perspron_P ↓    interrogative_S             np_W ↓ 
##        0.306620429        0.893127364        0.792181589        0.554602774 
##       time_adv_W ↓   text_initial_S ↓          predadj_W   coordination_F ↓ 
##        0.722257251        0.530453972        0.806683480        0.902459016 
##       pospers3_W ↓ 
##        0.713430013

Let us now look at the opposite extreme of the dimension, which characterises conceptual writing. Rather than taking the most extreme written registers, it might be instructive to look at spoken texts with large positive dimension scores (which aren’t all that far away from the overall maximum of 2.5605068).

idx <- Meta$mode == "spoken"
sample2 <- rank(-ICE9.X[idx, 1]) <= 10 # spoken texts with 10 highest dimension scores
sample2 <- rownames(ICE9.X)[idx][sample2] # corresponding text IDs 
cbind(LDA1=ICE9.X[sample2, 1], Meta[sample2, ])
## Key: <id>
##         LDA1                id       variety   mode    format short32
##        <num>            <char>        <fctr> <fctr>    <fctr>  <fctr>
##  1: 1.965070  icecan_s2b-030_1        Canada spoken monologue  broadT
##  2: 1.881724   icegb_s2a-033_1 Great Britain spoken monologue  unscrS
##  3: 1.916082  iceind_s2b-035_1         India spoken monologue  broadT
##  4: 1.911810  iceire_s2b-040_1       Ireland spoken monologue  broadT
##  5: 1.899717   icenz_s2b-004_2   New Zealand spoken monologue  broadN
##  6: 1.911991  icephi_s2b-011_1   Philippines spoken monologue  broadN
##  7: 1.911770 icesing_s2b-008_2     Singapore spoken monologue  broadN
##  8: 1.957290 icesing_s2b-010_2     Singapore spoken monologue  broadN
##  9: 1.988551 icesing_s2b-011_1     Singapore spoken monologue  broadN
## 10: 1.891402 icesing_s2b-011_2     Singapore spoken monologue  broadN
##               textcat32      code32 short20             textcat20 code20
##                  <fctr>      <fctr>  <fctr>                <fctr> <fctr>
##  1:     broadcast talks S2B-021-040  script   scripted monologues    S2B
##  2: unscripted speeches S2A-021-050   unscr unscripted monologues   S2A1
##  3:     broadcast talks S2B-021-040  script   scripted monologues    S2B
##  4:     broadcast talks S2B-021-040  script   scripted monologues    S2B
##  5:      broadcast news S2B-001-020  script   scripted monologues    S2B
##  6:      broadcast news S2B-001-020  script   scripted monologues    S2B
##  7:      broadcast news S2B-001-020  script   scripted monologues    S2B
##  8:      broadcast news S2B-001-020  script   scripted monologues    S2B
##  9:      broadcast news S2B-001-020  script   scripted monologues    S2B
## 10:      broadcast news S2B-001-020  script   scripted monologues    S2B
##     short12  textcat12 code12 shortvar  word  sent subset  group
##      <fctr>     <fctr> <fctr>   <fctr> <int> <int> <fctr> <fctr>
##  1:  script   scripted    S2B      CAN   152    10    new   West
##  2:   unscr unscripted    S2A       GB   677    34    new   West
##  3:  script   scripted    S2B      IND  1444    57    new   Asia
##  4:  script   scripted    S2B      IRE  1985    72    new   West
##  5:  script   scripted    S2B       NZ  1002    37    old   ICE3
##  6:  script   scripted    S2B      PHI   693    25    new   Asia
##  7:  script   scripted    S2B      SIN   488    25    new   Asia
##  8:  script   scripted    S2B      SIN   561    32    new   Asia
##  9:  script   scripted    S2B      SIN   452    22    new   Asia
## 10:  script   scripted    S2B      SIN   491    22    new   Asia

We select the only text from ICE-GB in this sample (icegb_s2a-033_1), which is identified as an unscripted speech (while the other texts are scripted broadcast news and talks). This should make for a particularly enlightening comparison with the phone call above.

text2 <- "icegb_s2a-033_1"

We already have computed feature contributions and quantiles for this dimension, which we can reuse for the text at hand.

tmp <- sort(ICE9.Contrib1[text2, ], decreasing=TRUE)
tmp
##    pronoun_all_W ↓   disc_initial_S ↓    p2_perspron_P ↓    lexical_density 
##       0.2922981711       0.2310593945       0.2201371298       0.2158285766 
##       pospers1_W ↓       pospers3_W ↓   verb_initial_S ↓            atadj_W 
##       0.1914302711       0.1093647967       0.0930739802       0.0842871305 
##             word_S    p1_perspron_P ↓               it_P             prep_W 
##       0.0836418023       0.0783836180       0.0734711672       0.0689021902 
##           finite_S          nominal_W       modal_verb_V     wh_initial_S ↓ 
##       0.0638785325       0.0579021084       0.0538706344       0.0504651767 
##           will_F ↓     prep_initial_S          passive_F             np_W ↓ 
##       0.0445504804       0.0340438537       0.0302557807       0.0281294696 
##      place_adv_W ↓       time_adv_W ↓      nom_initial_S          predadj_W 
##       0.0265779436       0.0219088539       0.0185254681       0.0160342883 
##       pospers2_W ↓           verb_W ↓   text_initial_S ↓               nn_W 
##       0.0092249512       0.0074051146       0.0061455599       0.0008901016 
## nonfin_initial_S ↓       infinitive_F            title_W    subordination_F 
##      -0.0011829645      -0.0013327841      -0.0015934619      -0.0050200955 
##    p3_perspron_P ↓         neoclass_W    interrogative_S   subord_initial_S 
##      -0.0059645128      -0.0076956853      -0.0200097859      -0.0215317137 
##       past_tense_F   coordination_F ↓       imperative_S    adv_initial_S ↓ 
##      -0.0253389328      -0.0368560661      -0.0513171624      -0.0580153443 
##     poss_pronoun_W 
##      -0.0941043527
sum(tmp[c(3,5,6,10)]) # features relating to personal pronouns
## [1] 0.5993158
sum(tmp[1:10]) # total contribution of features
## [1] 1.599505
sum(tmp[tmp < 0]) # pushback from negative contributions
## [1] -0.3299629

The contributions are less concentrated and spread over a large number of features. The first 10 features still push the text quite far to the positive end of the dimension. Note that there are also a considerable number of features with negative contributions (i.e. indicators of conceptual speaking), but their total contribution is relatively small. The corresponding quantiles will be much less extreme than before because they are calculated across all texts rather than just the spoken texts.

ICE9.Quant1[text2, ][order(-ICE9.Contrib1[text2, ])]
##    pronoun_all_W ↓   disc_initial_S ↓    p2_perspron_P ↓    lexical_density 
##         0.97225725         0.71166456         0.83619168         0.77767970 
##       pospers1_W ↓       pospers3_W ↓   verb_initial_S ↓            atadj_W 
##         0.81506936         0.89987390         0.83133670         0.99129887 
##             word_S    p1_perspron_P ↓               it_P             prep_W 
##         0.71576293         0.59741488         0.95882724         0.91172762 
##           finite_S          nominal_W       modal_verb_V     wh_initial_S ↓ 
##         0.66702396         0.88158890         0.90617907         0.74558638 
##           will_F ↓     prep_initial_S          passive_F             np_W ↓ 
##         0.64829760         0.69791929         0.86456494         0.94073140 
##      place_adv_W ↓       time_adv_W ↓      nom_initial_S          predadj_W 
##         0.63266078         0.86935687         0.65094578         0.67427491 
##       pospers2_W ↓           verb_W ↓   text_initial_S ↓               nn_W 
##         0.84464061         0.71727617         0.42736444         0.75220681 
## nonfin_initial_S ↓       infinitive_F            title_W    subordination_F 
##         0.31696091         0.55321564         0.30662043         0.36872636 
##    p3_perspron_P ↓         neoclass_W    interrogative_S   subord_initial_S 
##         0.13877680         0.52408575         0.25372005         0.22591425 
##       past_tense_F   coordination_F ↓       imperative_S    adv_initial_S ↓ 
##         0.26116015         0.19110971         0.19426230         0.12496847 
##     poss_pronoun_W 
##         0.07465322

We can also determine separate quantiles for spoken and written texts, which are considerably more extreme for our chosen text (as expected).

ICE9.Quant1Mode <- feature.quantiles(ICE9.Contrib1, Meta$mode)
ICE9.Quant1Mode[text2, ][order(-ICE9.Contrib1[text2, ])]
##    pronoun_all_W ↓   disc_initial_S ↓    p2_perspron_P ↓    lexical_density 
##         0.99971363         0.84650630         0.94716495         0.92525773 
##       pospers1_W ↓       pospers3_W ↓   verb_initial_S ↓            atadj_W 
##         0.94845361         0.98911798         0.90979381         0.99914089 
##             word_S    p1_perspron_P ↓               it_P             prep_W 
##         0.81500573         0.74369989         0.98840206         0.97193585 
##           finite_S          nominal_W       modal_verb_V     wh_initial_S ↓ 
##         0.66995991         0.97422680         0.94172394         0.86168385 
##           will_F ↓     prep_initial_S          passive_F             np_W ↓ 
##         0.69172394         0.85824742         0.96391753         0.88316151 
##      place_adv_W ↓       time_adv_W ↓      nom_initial_S          predadj_W 
##         0.83190149         0.98281787         0.86397480         0.68470790 
##       pospers2_W ↓           verb_W ↓   text_initial_S ↓               nn_W 
##         0.95031501         0.85652921         0.74369989         0.91638030 
## nonfin_initial_S ↓       infinitive_F            title_W    subordination_F 
##         0.27391180         0.69644903         0.28164376         0.36497709 
##    p3_perspron_P ↓         neoclass_W    interrogative_S   subord_initial_S 
##         0.10395189         0.61941581         0.18241695         0.15650057 
##       past_tense_F   coordination_F ↓       imperative_S    adv_initial_S ↓ 
##         0.21506300         0.10337915         0.11741123         0.18284651 
##     poss_pronoun_W 
##         0.04667812

Finally, let us look at the written text categories of creative writing and social letters, which extend far down into the conceptually spoken range of LDA dimension 1. This is very plausible linguistically, but there is a lot of variability and especially creative writing also extends far into the positive (conceptually written) range.

summary(ICE9.X[Meta$textcat20 == "creative writing", 1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.9173 -0.6385 -0.1676 -0.1728  0.2496  1.5162
summary(ICE9.X[Meta$textcat20 == "social letters", 1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.7585 -1.4660 -1.0739 -1.0531 -0.6635  1.7076

Here we are interested in which features in particular make some of these texts show properties of conceptual writing. I.e. we want to look at texts from the creative writing category with high scores on dimension 1.

idx <- Meta$textcat20 %in% c("creative writing", "social letters")
sample3 <- rank(-ICE9.X[idx, 1]) <= 10 # creative writing texts with 10 highest dimension scores
sample3 <- rownames(ICE9.X)[idx][sample3] # corresponding text IDs 
cbind(LDA1=ICE9.X[sample3, 1], Meta[sample3, ])
## Key: <id>
##          LDA1                id       variety    mode      format short32
##         <num>            <char>        <fctr>  <fctr>      <fctr>  <fctr>
##  1: 1.3777851  icecan_w2f-013_1        Canada written     printed   creat
##  2: 1.0249817   icegb_w2f-017_1 Great Britain written     printed   creat
##  3: 0.9896523   icehk_w2f-017_7     Hong Kong written     printed   creat
##  4: 1.0161583  iceind_w2f-011_1         India written     printed   creat
##  5: 1.5162160  iceire_w2f-013_1       Ireland written     printed   creat
##  6: 1.4770217  iceire_w2f-015_1       Ireland written     printed   creat
##  7: 1.2786506 icephi_w1b-015_10   Philippines written non-printed  socLet
##  8: 1.4347529  icephi_w2f-003_1   Philippines written     printed   creat
##  9: 0.9913376  icephi_w2f-014_1   Philippines written     printed   creat
## 10: 1.7075683 icesing_w1b-013_1     Singapore written non-printed  socLet
##                    textcat32      code32 short20        textcat20 code20
##                       <fctr>      <fctr>  <fctr>           <fctr> <fctr>
##  1: novels and short stories W2F-001-020   creat creative writing    W2F
##  2: novels and short stories W2F-001-020   creat creative writing    W2F
##  3: novels and short stories W2F-001-020   creat creative writing    W2F
##  4: novels and short stories W2F-001-020   creat creative writing    W2F
##  5: novels and short stories W2F-001-020   creat creative writing    W2F
##  6: novels and short stories W2F-001-020   creat creative writing    W2F
##  7:           social letters W1B-001-015  socLet   social letters   W1B1
##  8: novels and short stories W2F-001-020   creat creative writing    W2F
##  9: novels and short stories W2F-001-020   creat creative writing    W2F
## 10:           social letters W1B-001-015  socLet   social letters   W1B1
##     short12        textcat12 code12 shortvar  word  sent subset  group
##      <fctr>           <fctr> <fctr>   <fctr> <int> <int> <fctr> <fctr>
##  1:   creat creative writing    W2F      CAN  2035    90    new   West
##  2:   creat creative writing    W2F       GB  2025   122    new   West
##  3:   creat creative writing    W2F       HK   545    32    old   ICE3
##  4:   creat creative writing    W2F      IND  1945    76    new   Asia
##  5:   creat creative writing    W2F      IRE  2058    92    new   West
##  6:   creat creative writing    W2F      IRE  1971    89    new   West
##  7:  letter          letters    W1B      PHI   209    18    new   Asia
##  8:   creat creative writing    W2F      PHI  2230    61    new   Asia
##  9:   creat creative writing    W2F      PHI  2570   102    new   Asia
## 10:  letter          letters    W1B      SIN  2019    90    new   Asia

We select icecan_w2f-013_1, which is the fifth most extreme text in these two categories. One social letter from ICE-SING is very likely a questionable outlier due to its extreme deviation from the distribution of the category.

text3 <- "icecan_w2f-013_1"
Meta[text3, ]
## Key: <id>
##                  id variety    mode  format short32                textcat32
##              <char>  <fctr>  <fctr>  <fctr>  <fctr>                   <fctr>
## 1: icecan_w2f-013_1  Canada written printed   creat novels and short stories
##         code32 short20        textcat20 code20 short12        textcat12 code12
##         <fctr>  <fctr>           <fctr> <fctr>  <fctr>           <fctr> <fctr>
## 1: W2F-001-020   creat creative writing    W2F   creat creative writing    W2F
##    shortvar  word  sent subset  group
##      <fctr> <int> <int> <fctr> <fctr>
## 1:      CAN  2035    90    new   West

As before, we obtain feature contributions that push this text to the positive side of the dimension.

tmp <- sort(ICE9.Contrib1[text3, ], decreasing=TRUE)
tmp
##   disc_initial_S ↓    p2_perspron_P ↓           finite_S    p1_perspron_P ↓ 
##        0.231059395        0.171883930        0.144066045        0.138217552 
##             word_S       pospers1_W ↓    pronoun_all_W ↓     prep_initial_S 
##        0.125574258        0.114603850        0.108595921        0.105052798 
##   verb_initial_S ↓    lexical_density             prep_W     wh_initial_S ↓ 
##        0.068618556        0.055635887        0.052123570        0.050465177 
##           will_F ↓     poss_pronoun_W   text_initial_S ↓          nominal_W 
##        0.039336165        0.039129694        0.034508645        0.033082337 
##      nom_initial_S            atadj_W    adv_initial_S ↓           verb_W ↓ 
##        0.023809558        0.020614849        0.019856218        0.012899872 
##       pospers2_W ↓          passive_F nonfin_initial_S ↓            title_W 
##        0.008073951        0.006222566        0.004168496        0.004096396 
##    subordination_F             np_W ↓               nn_W    p3_perspron_P ↓ 
##        0.002079598        0.001644824        0.000990941       -0.002140256 
##          predadj_W      place_adv_W ↓       pospers3_W ↓   subord_initial_S 
##       -0.002479576       -0.004331385       -0.006021358       -0.007126485 
##       past_tense_F       time_adv_W ↓       infinitive_F         neoclass_W 
##       -0.008284217       -0.010549127       -0.013395654       -0.013462907 
##    interrogative_S               it_P       imperative_S       modal_verb_V 
##       -0.020009786       -0.023701576       -0.034080379       -0.044210021 
##   coordination_F ↓ 
##       -0.048833265
sum(tmp[c(2, 4, 6)]) # features relating to personal pronouns
## [1] 0.4247053
sum(tmp[tmp > 0]) # total positive contribution
## [1] 1.616411
sum(tmp[tmp < 0]) # total negative contribution
## [1] -0.238626

Contrary to what one might expect, the position of the text is not the result of a set of very pronounced “conceptual writing” features pushing against a general “conceptual speaking” character. The total contribution of negative features is rather small.